Skip to content

Add an idempotency key to trace/metrics/logs Export*ServiceRequest#219

Closed
jmacd wants to merge 5 commits intoopen-telemetry:masterfrom
jmacd:jmacd/idempotency
Closed

Add an idempotency key to trace/metrics/logs Export*ServiceRequest#219
jmacd wants to merge 5 commits intoopen-telemetry:masterfrom
jmacd:jmacd/idempotency

Conversation

@jmacd
Copy link
Copy Markdown
Contributor

@jmacd jmacd commented Sep 3, 2020

Adds an idempotency key as described in #218. This is to support AggregationTemporality=DELTA correctly in systems that ingest and store DELTA data points.

Fixes #218.

@jmacd jmacd requested a review from a team September 3, 2020 00:04
Copy link
Copy Markdown
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to add this to OTLP in general. Idempotency is a good property to prevent duplicate data and is a good companion for protocols/implementations which err on the retry side (which OTLP does via some nuanced sentences in the spec).

Comment thread opentelemetry/proto/collector/metrics/v1/metrics_service.proto Outdated
@jmacd jmacd changed the title Add an idempotency key to ExportMetricsServiceRequest Add an idempotency key to trace/metrics/logs Export*ServiceRequest Sep 3, 2020
//
// [V4 UUID]([RFC 4122](https://tools.ietf.org/html/rfc4122)) is the
// recommended idempotency key format.
string idempotency_key = 2;
Copy link
Copy Markdown
Member

@tigrannajaryan tigrannajaryan Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you changed it from bytes to string. This is probably OK, but it is bigger on the wire. We could keep the bytes, so that it is typically 16 bytes of UUID on the wire instead of 36 bytes in string representation.

If it is a string then this comment probably needs to be re-worded since it refers to bytes and it is not clear what that means:

Values of size less than 8 bytes are invalid and MUST be ignored.

My slight preference is to keep it as bytes and recommend V4 UUID in the bytes.

Copy link
Copy Markdown
Contributor Author

@jmacd jmacd Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I have seen 8-byte keys in use, so I was hesitating to require a 16-byte key. I looked at the benchmarks in https://github.com/satori/go.uuid which suggest a UUIDv4 takes less than 1 microsecond to generate, which seems like not a big deal. So I've reworded this (in the three files) to say:

  // OTLP clients SHOULD include an idempotency key.  Empty
  // idempotency keys are invalid and MUST be ignored.  A 128-bit V4
  // UUID (https://tools.ietf.org/html/rfc4122) is the
  // recommended idempotency key format.
  bytes idempotency_key = 2;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with if you want to only recommend 16 byte UUIDs, but not require it. My concern was primarily around using bytes vs string as the data type.

@tigrannajaryan
Copy link
Copy Markdown
Member

@jmacd I know I already gave my approval but I have some concerns. I have been thinking about this since yesterday and haven't managed to convince myself. Can you please help me understand?

If I understand correctly the problem this solves is the duplicate data on the server side. What I do not quite understand is how important it is to avoid the duplicates. Are we looking for a guaranteed lack of duplicates or only for a reduction in the number of duplicates?

The solutions as proposed works on the request level. For intermediary Collectors this has limited use. Collectors can (and often are in production) be behind a load balancer with no guarantees that retried requests go to the same Collector instance. This makes detection either impossible if we store the idempotency keys locally per instance or requires Collector instances to coordinate and exchange idempotency keys (which is complicated and I am not sure we will actually implement it in the Collector). If we don't implement idempotency on the receiving end of the Collector then I am not sure of the overall usefulness of this concept in a real production deployment of telemetry collection pipeline.

I am not sure if the solution in this PR seeks to prevent duplicates or to guarantee exactly-once delivery. If it is the former then we can possibly achieve it in a different way: by changing the retrying logic in OTLP to never retry in situations which may result in duplicates (particularly in all situations when the server does not explicitly tell that they did not receive the data). This can be an operation mode for OTLP (as opposed to the currently specified operation which aims for at-least-once). Obviously we will in this case increase the risk of data losses so it is a tradeoff.

If we are looking for exactly-once guarantees that's a different story and I don't think this proposal solves it.

In the presence of intermediary nodes and multiple delivery routes request-level measures that operate on a single hop of communication are not sufficient to implement delivery guarantees. Such guarantees likely require communicating ids/keys that travel together with telemetry data end-to-end, intact, through multiple hops of collection pipeline. This proposal does not provide the means for that.

@jmacd
Copy link
Copy Markdown
Contributor Author

jmacd commented Sep 8, 2020

@tigrannajaryan this proposal is very limited, as you pointed out. Let me for the moment illustrate how I was imagining this can be used. Provided that the collector simply passes through each OTLP request, so that one request in equals one request out. The collector can modify the data (filter, rename, etc.), but the output corresponds to 100% of the input and would use the same idempotency key. The OTLP request will eventually be received by an endpoint that intends to store the data, at which point I presume the data from the original request will be split into individual units (e.g., metric data points, log events, spans) before writing. Each individual unit of data would keep the same idempontency key. The final destination of this data pipeline would then store an individual unit of data with the associated idempontency key. When another individual unit of data arrives it will go to the same storage location, at which point the storage layer can recognize a duplicate point by its idempotency key. Generally this results in inflated storage costs for recent data, and commonly this is addressed by rewriting data after the point that no new data will be accepted, dropping idempotency keys at that point in time.

The approach above works in the presence of multiple routes and a horizontally-scaled collector pool, but the one-request-to-one-request property limits what the collector can do, particularly its ability to correctly batch data.

To correct this problem in general could take any number of forms. This is where we could begin to coordinate an exchange idempotency keys between collectors. A more straight-forward system architecture would place a Kafka queue in between the producer of individual requests and the consumer that batches them. The consumer then has the proper state to deduplicate requests by their idempontency key. To correctly use collector pipeline that batches or joins requests in some way would require this sort of solution.

I am interested in this simple form of idempotency key because it makes simple things possible, while we could continue to study the ways that more general forms of idempotency can be arranged.

@tigrannajaryan
Copy link
Copy Markdown
Member

Provided that the collector simply passes through each OTLP request, so that one request in equals one request out.

This is not how Collector works today. Collector is not designed to pass requests, it unpacks the requests at the receiving end and drops any portion of the request Protobuf message that is not the telemetry data itself. We don't have a good way of preserving the request idempotency key in the Collector pipeline.

The approach above works in the presence of multiple routes and a horizontally-scaled collector pool, but the one-request-to-one-request property limits what the collector can do, particularly its ability to correctly batch data.

Yes, it would work if Collector worked the way you described earlier, but it doesn't at the moment and I don't think we can make it work that way.

I understand that in some scenarios where the Collector is not present and you send directly to the backend your proposal will work and is valuable, but I wonder if that is good enough for your needs (or for needs of others who may be also interested in preventing duplicates). Do you envision that it will be common for you/your customers to not have a Collector?

Copy link
Copy Markdown
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking this PR, because I feel we don't need it right now (maybe I am wrong), but as @tigrannajaryan mentioned things like batching in the collector will not work.

Can we work on this after GA? If it is extremely critical we can discuss more about this, but I like to set the mindset that if not extremely critical lets delay after GA.

@jmacd
Copy link
Copy Markdown
Contributor Author

jmacd commented Dec 23, 2020

Further investigation into the Metrics part of OTLP leads me to think this is not needed after we specify a mandatory service.instance.id resource combined with timestamps in each data point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add an idempotency key to ExportMetricsServiceRequest

4 participants